Show the code
import pandas as pd
import numpy as np
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)Course DS 250
[Tanner Hamblin]
A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)
The following graphs show that the best month to fly is december to avoid delays is september. The most common type of delay you are likley to encounter is weather delays. The airport out of the list with the lowest average delay is SAN with a delay of .79 hours. The worst is Orlando with an average delay of 1.13 hours.
Fix all of the varied missing data types in the data to be consistent (all missing values should be displayed as “NaN”). In your report include one record example (one row) from your new data, in the raw JSON format. Your example should display the “NaN” for at least one missing value.__
type your results and analysis here
Dropped any outliers or missing data replacing them with the value NAN
df.num_of_delays_late_aircraft.replace(-999,np.nan,inplace = True)
df.num_of_delays_carrier.replace("1500+",1500,inplace = True)
df.airport_name.replace("",np.nan,inplace = True)
df.month.replace("n/a",np.nan,inplace = True)
df.minutes_delayed_nas.replace(-999.0,np.nan,inplace = True)
df.num_of_delays_late_aircraft.value_counts(dropna=False)
df['num_of_delays_carrier'] = df['num_of_delays_carrier'].astype(int)
df.month.replace("Febuary","February",inplace = True)
month_order = ["January", "February", "March", "April",
"May", "June", "July", "August",
"September", "October", "November", "December"]
df['month'] = pd.Categorical(df['month'],categories=month_order, ordered=True)
df.head(1) | airport_code | airport_name | month | year | num_of_flights_total | num_of_delays_carrier | num_of_delays_late_aircraft | num_of_delays_nas | num_of_delays_security | num_of_delays_weather | num_of_delays_total | minutes_delayed_carrier | minutes_delayed_late_aircraft | minutes_delayed_nas | minutes_delayed_security | minutes_delayed_weather | minutes_delayed_total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ATL | Atlanta, GA: Hartsfield-Jackson Atlanta Intern... | January | 2005.0 | 35048 | 1500 | NaN | 4598 | 10 | 448 | 8355 | 116423.0 | 104415 | 207467.0 | 297 | 36931 | 465533 |
Which airport has the worst delays? Describe the metric you chose, and why you chose it to determine the “worst” airport. Your answer should include a summary table that lists (for each airport) the total number of flights, total number of delayed flights, proportion of delayed flights, and average delay time in hours.
I used the metric of sorting the flights by airport and using a metric I found the total flights per airport as well as the total delayed flights by airports. using these I found the proportion of delay time per flight. SFO sanfrancisco has the worst proportion with .261 of the flights having delays. Then I found the average delay time and we can see that Orlando has the worst average delay time of 1.13 hours.
df2 = df.groupby("airport_code").agg(
total_flights = ('num_of_flights_total', 'sum'),
total_delayed_flights = ('num_of_delays_total', 'sum'),
sum_delay_time = ('minutes_delayed_total', 'sum')
)
df2['sum_delay_time'] = df2['sum_delay_time'] / 60
df2['prop_delayed_flight'] = round(df2['total_delayed_flights'] / df2['total_flights'],3)
df2['avg_delay_time'] = round(df2['sum_delay_time'] / df2['total_delayed_flights'],2)
df2
# df2.columns| total_flights | total_delayed_flights | sum_delay_time | prop_delayed_flight | avg_delay_time | |
|---|---|---|---|---|---|
| airport_code | |||||
| ATL | 4430047 | 902443 | 899732.100000 | 0.204 | 1.00 |
| DEN | 2513974 | 468519 | 419556.350000 | 0.186 | 0.90 |
| IAD | 851571 | 168467 | 171391.300000 | 0.198 | 1.02 |
| ORD | 3597588 | 830825 | 939268.816667 | 0.231 | 1.13 |
| SAN | 917862 | 175132 | 137937.466667 | 0.191 | 0.79 |
| SFO | 1630945 | 425604 | 442508.216667 | 0.261 | 1.04 |
| SLC | 1403384 | 205160 | 168722.850000 | 0.146 | 0.82 |
What is the best month to fly if you want to avoid delays of any length? Describe the metric you chose and why you chose it to calculate your answer. Include one chart to help support your answer, with the x-axis ordered by month. (To answer this question, you will need to remove any rows that are missing the Month variable.)
I found the proportion of flight delays and total delayed flights by each month. we find that July has the most delayed flights with 319960 in the month of july.
| month | total_flights | total_delayed_flights | total_flight_by_delay | |
|---|---|---|---|---|
| 0 | January | 1193018 | 265001 | 0.22 |
| 1 | February | 1115814 | 248033 | 0.22 |
| 2 | March | 1213370 | 250142 | 0.21 |
| 3 | April | 1259723 | 231408 | 0.18 |
| 4 | May | 1227795 | 233494 | 0.19 |
| 5 | June | 1305663 | 317895 | 0.24 |
| 6 | July | 1371741 | 319960 | 0.23 |
| 7 | August | 1335158 | 279699 | 0.21 |
| 8 | September | 1227208 | 201905 | 0.16 |
| 9 | October | 1301612 | 235166 | 0.18 |
| 10 | November | 1185434 | 197768 | 0.17 |
| 11 | December | 1180278 | 303133 | 0.26 |
The following graph shows us that the proportion of delays to flights showing the chances of you experiencing a delay on your flight is lowest in the month of september.
p = (ggplot(df3, aes(x='month', y='total_flight_by_delay', )) +
geom_bar(stat = 'identity', color = 'black', fill = 'white', width = .85) +
xlab('Month') +
ylab('% of Flights Delayed') +
labs(
title="Chance of Experiencing Any Delay by Month"
)+
theme_classic() +
theme(
axis_text_y= element_text(),
plot_title= element_text(angle=90, size=20),
panel_background= element_rect(fill = "gray")
)
)
p.show()According to the BTS website, the “Weather” category only accounts for severe weather delays. Mild weather delays are not counted in the “Weather” category, but are actually included in both the “NAS” and “Late-Arriving Aircraft” categories. Your job is to create a new column that calculates the total number of flights delayed by weather (both severe and mild). You will need to replace all the missing values in the Late Aircraft variable with the mean. Show your work by printing the first 5 rows of data in a table. Use these three rules for your calculations:
a. 100% of delayed flights in the Weather category are due to weather
a. 30% of all delayed flights in the Late-Arriving category are due to weather
a. From April to August, 40% of delayed flights in the NAS category are due to weather. The rest of the months, the proportion rises to 65%
The following shows the first 5 rows of airport codes. They show that the total weather delays that are not severe. we see that the orlando has the most minor weather delays.
| airport_code | airport_name | month | year | num_of_flights_total | num_of_delays_carrier | num_of_delays_late_aircraft | num_of_delays_nas | num_of_delays_security | num_of_delays_weather | num_of_delays_total | minutes_delayed_carrier | minutes_delayed_late_aircraft | minutes_delayed_nas | minutes_delayed_security | minutes_delayed_weather | minutes_delayed_total | total_weather_delay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ATL | Atlanta, GA: Hartsfield-Jackson Atlanta Intern... | January | 2005.0 | 35048 | 1500 | 1109.104072 | 4598 | 10 | 448 | 8355 | 116423.0 | 104415 | 207467.0 | 297 | 36931 | 465533 | 2535.435294 |
| 1 | DEN | Denver, CO: Denver International | January | 2005.0 | 12687 | 1041 | 928.000000 | 935 | 11 | 233 | 3153 | 53537.0 | 70301 | 36817.0 | 363 | 21779 | 182797 | 2300.000000 |
| 2 | IAD | NaN | January | 2005.0 | 12381 | 414 | 1058.000000 | 895 | 4 | 61 | 2430 | NaN | 70919 | 35660.0 | 208 | 4497 | 134881 | 2469.000000 |
| 3 | ORD | Chicago, IL: Chicago O'Hare International | January | 2005.0 | 28194 | 1197 | 2255.000000 | 5415 | 5 | 306 | 9178 | 88691.0 | 160811 | 364382.0 | 151 | 24859 | 638894 | 4025.100000 |
| 4 | SAN | San Diego, CA: San Diego International | January | 2005.0 | 7283 | 572 | 680.000000 | 638 | 7 | 56 | 1952 | 27436.0 | 38445 | 21127.0 | 218 | 4326 | 91552 | 1977.600000 |
Using the new weather variable calculated above, create a barplot showing the proportion of all flights that are delayed by weather at each airport. Describe what you learn from this graph.
The following results show that your chances of encountering a delay whether it is from weather, security, or carrier all being factored in is the highest at IAD airport and lowest at ATL airport getting as low as 10% chance.
df4bar = df4.groupby('airport_code').agg(
total_flights = ('num_of_flights_total', 'sum'),
total_delayed_flights = ('total_weather_delay', 'sum'),
num_of_delays_carrier = ("num_of_delays_carrier", 'sum'),
num_of_delays_security = ("num_of_delays_security", 'sum'),
).reset_index()
df4bar['prop_of_delays_weather'] = round(df4bar['total_delayed_flights'] / df4bar['total_flights'],2)
df4bar['prop_of_delays_carrier'] = round(df4bar['num_of_delays_carrier'] / df4bar['total_flights'],2)
df4bar['prop_of_delays_security'] = round(df4bar['num_of_delays_security'] / df4bar['total_flights'],2)
df4bar| airport_code | total_flights | total_delayed_flights | num_of_delays_carrier | num_of_delays_security | prop_of_delays_weather | prop_of_delays_carrier | prop_of_delays_security | |
|---|---|---|---|---|---|---|---|---|
| 0 | ATL | 4430047 | 450637.470588 | 173075 | 833 | 0.10 | 0.04 | 0.0 |
| 1 | DEN | 2513974 | 378356.500000 | 121885 | 985 | 0.15 | 0.05 | 0.0 |
| 2 | IAD | 851571 | 228643.300000 | 47761 | 272 | 0.27 | 0.06 | 0.0 |
| 3 | ORD | 3597588 | 495550.741176 | 146491 | 862 | 0.14 | 0.04 | 0.0 |
| 4 | SAN | 917862 | 236136.500000 | 57130 | 490 | 0.26 | 0.06 | 0.0 |
| 5 | SFO | 1630945 | 305776.200000 | 86417 | 697 | 0.19 | 0.05 | 0.0 |
| 6 | SLC | 1403384 | 247641.500000 | 65033 | 867 | 0.18 | 0.05 | 0.0 |
p = (ggplot(df4bar, aes(x='airport_code', y='prop_of_delays_weather', )) +
geom_bar(stat = 'identity', color = 'black', fill = 'white', width = .85) +
xlab('airport_code') +
ylab('% of Flights Delayed') +
labs(
title="Chance of Experiencing Any Delay by Airport"
)+
theme_classic() +
theme(
axis_text_y= element_text(),
plot_title= element_text(angle=90, size=20),
panel_background= element_rect(fill = "gray")
)
)
p.show()Which delay is the worst delay? Create a similar analysis as above for Weahter Delay with: Carrier Delay and Security Delay. Compare the proportion of delay for each of the three categories in a Chart and a Table. Describe your results.
The following shows the delays but all types are separated into the categories weather, carrier, and security. We can see that carrier delays are highest in IAD and SAN airports. And the highest chance of encountering a delay from weather is also IAD and SAN.
df5 = pd.melt(
df4bar,
id_vars= 'airport_code',
value_vars= ['prop_of_delays_weather', 'prop_of_delays_carrier', 'prop_of_delays_security'],
var_name= 'delay_type',
value_name= 'proportion'
)
p = (
ggplot(
df5,
aes(x='airport_code',
y='proportion',
fill='delay_type') # <─ aesthetic mapping goes here
)
+ geom_bar(
stat='identity',
colour='black',
width=.85,
position='dodge' # side-by-side bars by delay type
)
+ xlab('Airport Code')
+ ylab('Proportion of Flights Delayed')
+ ggtitle('Chance of Experiencing Any Delay by Airport')
+ theme_classic()
+ theme(
axis_text_x=element_text(angle=45, hjust=1), # tilt if labels overlap
plot_title=element_text(size=20),
panel_background=element_rect(fill="gray")
)
)
p.show()